Add Qwen3VL MCore Export support from PR 895#1482
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds bidirectional Megatron Core ↔ Hugging Face weight mappings for Qwen3-VL, registers them as a plugin, includes tests validating import/export symmetry and prefix rules, and updates changelog and deployment docs for Qwen 3‑VL support. ChangesQwen3-VL Megatron Core Integration
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.rst`:
- Line 136: Move the changelog entry "Add Megatron Core export/import mapping
for Qwen3-VL (``Qwen3VLForConditionalGeneration``) vision-language models..."
out of the 0.42 (2026-03-10) section and place it under the current
unreleased/0.45 section header in CHANGELOG.rst, preserving the existing
formatting and inline code markup; ensure you remove the duplicate from 0.42 and
verify the entry appears exactly once under the 0.45 (unreleased) section.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7f30ba6d-8de4-4386-b197-7f8189e56d24
📒 Files selected for processing (5)
CHANGELOG.rstdocs/source/deployment/3_unified_hf.rstmodelopt/torch/export/plugins/mcore_common.pymodelopt/torch/export/plugins/mcore_qwen3vl.pytests/gpu_megatron/torch/export/test_mcore_qwen3vl.py
|
|
/claude review |
Claude Review SummarySmall, additive PR that clones the Qwen3 mcore mapping with Findings
Most impactful
Risk: Low-to-moderate. Code is purely additive, no existing arch behavior changes. Worst case is a broken Qwen3-VL round-trip that only manifests at runtime — which is exactly why the test placement matters. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1482 +/- ##
==========================================
+ Coverage 76.79% 76.85% +0.06%
==========================================
Files 474 475 +1
Lines 51578 51592 +14
==========================================
+ Hits 39607 39650 +43
+ Misses 11971 11942 -29
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add Megatron Core export/import mapping for Qwen3-VL (Qwen3VLForConditionalGeneration). Handles the model.language_model. weight prefix and supports both dense and MoE variants. Signed-off-by: Hung-Yueh <hungyuehc@nvidia.com> mv test_mcore_qwen3vl.py to tests/gpu_megatron/torch/export/ Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
2423ae7 to
a7d1170
Compare
Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
|
/claude review |
Claude Review SummaryFindings: CRITICAL: 0 · IMPORTANT: 0 · SUGGESTION: 2 The change is small and additive: it registers a new HF→MCore mapping for Notes
RiskLow. The change is isolated to an additive plugin entry and a new file; the round-trip has been validated by the author against |
Replace the hand-written dict literals in mcore_qwen3vl.py with a helper that derives the VL mapping from qwen3_causal_lm_import/export by inserting 'language_model.' after 'model.' in every prefix. lm_head. (root-level) is left unchanged. Remove TestQwen3VLvsQwen3Difference since it now tests the implementation against itself. Note visual encoder (model.visual.*) is intentionally excluded from the mapping. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
e520fe3 to
80495e6
Compare
IMPORTANT fixes: - Make Qwen3VLConfig / Qwen3VLForConditionalGeneration lazy imports inside get_tiny_qwen3vl() and _test_unified_export_megatron() so older transformers builds do not break collection of every test importing transformers_models.py - Drop unsupported MoE claim from CHANGELOG; only Qwen3VLForConditionalGeneration (dense) is registered in the mcore dispatch tables Suggestions: - Copy func_kwargs in _with_language_model_prefix to avoid shared-dict mutation between Qwen3 and Qwen3VL mapping entries - Add unquantized qwen3vl export test case alongside the FP8 one - Add torch.distributed.barrier() before the rank-0 vision-weight merge block to keep the code safe if TP > 1 is used later Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
|
/claude review |
| """Custom mapping from Qwen3-VL Hugging Face models to Megatron Core models. | ||
|
|
||
| Qwen3-VL differs from Qwen3 in one structural way: language-model weights live | ||
| under ``model.language_model.`` instead of ``model.``, while ``lm_head.weight`` | ||
| remains at the root level. The mappings below are derived automatically from | ||
| the Qwen3 mappings by inserting ``language_model.`` after ``model.`` for every | ||
| prefix that starts with ``model.``. | ||
|
|
||
| Note: the visual encoder (``model.visual.*``) is intentionally excluded — this | ||
| mapping covers only the language-model decoder used for quantization and export. | ||
|
|
||
| Reference: https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct/blob/main/model.safetensors.index.json | ||
| """ |
There was a problem hiding this comment.
[SUGGESTION] Scope-clarifying note: Qwen3-VL ships in two architectures — Qwen3VLForConditionalGeneration (dense) and Qwen3VLMoeForConditionalGeneration (MoE, e.g. Qwen/Qwen3-VL-30B-A3B-Instruct). This PR only registers the dense variant.
The MoE variant cannot reuse qwen3_causal_lm_export with a prefix rewrite because Qwen3-VL-MoE stores experts in fused form (mlp.experts.gate_up_proj/mlp.experts.down_proj as 3-D tensors) rather than the per-expert layout (mlp.experts.{}.down_proj) that qwen3_causal_lm_* assumes. So this is a real limitation, not just a missing registration.
Consider adding a one-line note to the module docstring (e.g. "Covers the dense Qwen3VL variant only; Qwen3VLMoeForConditionalGeneration uses a fused-expert layout and requires a separate mapping.") so that users hitting KeyError: 'Qwen3VLMoeForConditionalGeneration' in _populate_rule_book know it's intentional and what's missing.
| if model_type == "qwen3vl": | ||
| torch.distributed.barrier() | ||
| if model_type == "qwen3vl" and rank == 0: |
There was a problem hiding this comment.
[SUGGESTION] Asymmetric barrier: only the qwen3vl branch calls torch.distributed.barrier() and only rank 0 does the merge + load, but there's no second barrier after the rank-0 work. With the current dist_workers_size_1 (size=1) parametrization this is harmless, but if this test is ever extended to multi-rank size, non-zero ranks will exit _test_unified_export_megatron while rank 0 is still loading the merged checkpoint, and tmp_path may be torn down by the fixture before rank 0 finishes. A trailing torch.distributed.barrier() at the end of the rank-0 block (outside the rank == 0 guard) would future-proof this.
| prefix = m.target_name_or_prefix | ||
| if prefix.startswith("model."): | ||
| prefix = "model.language_model." + prefix[len("model.") :] | ||
| result[key] = type(m)(target_name_or_prefix=prefix, func_kwargs=dict(m.func_kwargs)) |
There was a problem hiding this comment.
[SUGGESTION] dict(m.func_kwargs) is a shallow copy — the inner parallel_config ParallelConfig instance is still shared with the source Qwen3 mapping. That's fine today because ParallelConfig is treated as immutable by all consumers, but it means any future code that mutates func_kwargs["parallel_config"] on a Qwen3VL entry would silently bleed into the Qwen3 entry as well. Since the cost is zero, copy.deepcopy(m.func_kwargs) would be more defensive.
There was a problem hiding this comment.
Claude review — LGTM with minor suggestions.
Findings: CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 3 (all inline)
Summary of review:
- The mapping derivation is correct. Verified against Qwen/Qwen3-VL-8B-Instruct's
model.safetensors.index.json: language-model weights live undermodel.language_model.*, vision encoder undermodel.visual.*, andlm_head.weightis at root — which matches the prefix-rewrite logic exactly. - The string-based prefix rewrite is currently safe because every
model.-prefixed entry inqwen3_causal_lm_{import,export}is a language-model component andlm_head.(root) is left untouched. unified_export_megatron.pyalready handles multimodal configs by reading fromtext_config(line 146), so the new arch slots in cleanly without further plumbing.- Test changes correctly thread
text_configthrough the export path and merge in the vision tower for the round-trip from_pretrained sanity check.
Inline suggestions (non-blocking):
Qwen3VLMoeForConditionalGenerationis unsupported — the MoE variant uses a fused expert layout (mlp.experts.gate_up_proj/mlp.experts.down_projas 3-D tensors) and can't reuse the dense Qwen3 mapping. A one-line scope note in the module docstring would save users a confusingKeyError.- Asymmetric barrier in the
qwen3vltest branch — currently fine underdist_workers_size_1but would race if the test is ever extended to multi-rank. dict(m.func_kwargs)is a shallow copy;copy.deepcopywould be more defensive against future mutation.
Risk assessment: Low. Additive change, dense Qwen3-VL only, no impact on existing model paths.
- Add docstring note that Qwen3VLMoeForConditionalGeneration is unsupported (MoE variant uses 3-D fused expert tensors incompatible with dense mapping) - Replace dict(m.func_kwargs) with copy.deepcopy for full isolation of copied mapping entries - Fix asymmetric barrier: fold rank==0 check under the barrier so all ranks participate before any rank writes vision weights Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
|
/claude review |
|
Claude review summary Findings: CRITICAL 0, IMPORTANT 0, SUGGESTION 2 Highlights:
Inline suggestions:
Risk: low. Additive plugin, well-contained, with unit and manual integration test coverage. Both findings are non-blocking suggestions. |
| def _merge_vision_weights(src_dir: Path, dst_dir: Path) -> None: | ||
| """Copy model.visual.* tensors from src safetensors into dst_dir. | ||
|
|
||
| The mcore export only writes language-model weights. To produce a complete | ||
| Qwen3-VL checkpoint the vision-encoder weights must be merged from the | ||
| original pretrained checkpoint. | ||
| """ | ||
| vision_tensors = {} | ||
| for sf in sorted(src_dir.glob("*.safetensors")): | ||
| with safe_open(str(sf), framework="pt", device="cpu") as f: | ||
| for key in f.keys(): # noqa: SIM118 | ||
| if key.startswith("model.visual."): | ||
| vision_tensors[key] = f.get_tensor(key) | ||
| if vision_tensors: | ||
| save_file(vision_tensors, str(dst_dir / "model-vision.safetensors")) |
There was a problem hiding this comment.
[SUGGESTION] _merge_vision_weights writes model-vision.safetensors but does not update model.safetensors.index.json. The exporter writes the index in save_safetensors_by_layer_index / save_safetensors, and Qwen3VLForConditionalGeneration.from_pretrained uses that index when one is present — any safetensors file not referenced by the index is ignored.
Net effect on this test:
- The
any(k.startswith("model.visual.") for k in keys)assertion only checks that the file exists on disk, not that it is discoverable byfrom_pretrained. - Because the forward pass below uses
input_idsonly (nopixel_values), the vision encoder is never exercised, so randomly-initialised vision weights would not cause a failure.
So the test claims to validate a complete Qwen3-VL checkpoint round-trip, but in practice it only validates the language-model portion. Consider either (a) rewriting the index file after merging vision weights and verifying vl_model.model.visual.patch_embed.proj.weight (or similar) round-trips bit-exact, or (b) tightening the docstring/assertions to make clear this is a language-model-only round-trip and the vision merge is a placeholder for the manual workflow.
| mapping: dict[str, CustomModuleMapping], | ||
| ) -> dict[str, CustomModuleMapping]: | ||
| """Derive a VL mapping from a base Qwen3 mapping. | ||
|
|
||
| Rewrites every ``target_name_or_prefix`` that starts with ``model.`` to | ||
| ``model.language_model.<rest>``. Prefixes that do not start with | ||
| ``model.`` (e.g. ``lm_head.``) are left unchanged. | ||
| """ | ||
| result = {} | ||
| for key, m in mapping.items(): | ||
| prefix = m.target_name_or_prefix | ||
| if prefix.startswith("model."): | ||
| prefix = "model.language_model." + prefix[len("model.") :] | ||
| result[key] = type(m)( | ||
| target_name_or_prefix=prefix, func_kwargs=copy.deepcopy(m.func_kwargs) | ||
| ) | ||
| return result | ||
|
|
There was a problem hiding this comment.
[SUGGESTION] Reconstructing each mapping via type(m)(target_name_or_prefix=..., func_kwargs=...) works today only because every CustomModuleMapping subclass in mcore_custom.py happens to share the same (target_name_or_prefix, func_kwargs) init signature. If a future subclass adds another required constructor arg (or changes the keyword names), this loop will silently break or drop information.
A more robust pattern is to deep-copy the original mapping and just rewrite the prefix:
def _with_language_model_prefix(
mapping: dict[str, CustomModuleMapping],
) -> dict[str, CustomModuleMapping]:
result = {}
for key, m in mapping.items():
new_m = copy.deepcopy(m)
if new_m.target_name_or_prefix.startswith("model."):
new_m.target_name_or_prefix = (
"model.language_model." + new_m.target_name_or_prefix[len("model.") :]
)
result[key] = new_m
return resultThis preserves func_name and any future fields automatically and removes the implicit coupling to subclass __init__ signatures.
|
/ok to test e8e2d7b |
|
/ok to test 1a86b05 |
There was a problem hiding this comment.
how long does running all these tests take? It should be printed at the end of pytest command so we can see if these tests are too slow or fine
There was a problem hiding this comment.
Yes, it is printed at the end
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-9.0.3, pluggy-1.6.0 -- /opt/venv/bin/python
cachedir: .pytest_cache
hypothesis profile 'default' -> database=InMemoryExampleDatabase({})
rootdir: /workspace/modules/Model-Optimizer
configfile: pyproject.toml
plugins: hydra-core-1.3.2, anyio-4.9.0, timeout-2.4.0, mock-3.15.1, langsmith-0.7.30, typeguard-4.5.0, rerunfailures-16.1, xdist-3.8.0, shard-0.1.2, flakefinder-1.1.0, hypothesis-6.130.8
collecting ... collected 18 items
Running 18 items in this shard: tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-None-None-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-None-NVFP4_DEFAULT_CFG-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.p
y::test_unified_export_megatron[nemotron-None-NVFP4_DEFAULT_CFG-FP8_KV_CFG], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-eagle-None-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-medusa-None-None], tests/gpu_megatron/torch/ex
port/test_unified_export_megatron.py::test_unified_export_megatron[llama-None-None-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-None-FP8_DEFAULT_CFG-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-None-FP8_DEFAULT_CFG-FP8_KV_C
FG], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-eagle-None-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-medusa-None-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[qwen3vl-Non
e-None-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[qwen3vl-None-FP8_DEFAULT_CFG-None], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_import_megatron[llama], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_import_megatron[qwen3vl
], tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_qkv_slicing_gqa_tp2, tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_mtp_state_dict_single_safetensors, tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_mtp_state_dict_no_mtp_keys, tests/gpu_megatron/torch/export/test_unified_expor
t_megatron.py::test_mtp_state_dict_index_file
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-None-None-None] PASSED [ 5%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-None-NVFP4_DEFAULT_CFG-None] PASSED [ 11%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-None-NVFP4_DEFAULT_CFG-FP8_KV_CFG] PASSED [ 16%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-eagle-None-None] PASSED [ 22%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[nemotron-medusa-None-None] PASSED [ 27%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-None-None-None] PASSED [ 33%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-None-FP8_DEFAULT_CFG-None] PASSED [ 38%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-None-FP8_DEFAULT_CFG-FP8_KV_CFG] PASSED [ 44%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-eagle-None-None] PASSED [ 50%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[llama-medusa-None-None] PASSED [ 55%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[qwen3vl-None-None-None] PASSED [ 61%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_export_megatron[qwen3vl-None-FP8_DEFAULT_CFG-None] PASSED [ 66%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_import_megatron[llama] PASSED [ 72%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_unified_import_megatron[qwen3vl] PASSED [ 77%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_qkv_slicing_gqa_tp2 PASSED [ 83%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_mtp_state_dict_single_safetensors PASSED [ 88%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_mtp_state_dict_no_mtp_keys PASSED [ 94%]
tests/gpu_megatron/torch/export/test_unified_export_megatron.py::test_mtp_state_dict_index_file PASSED [100%]
=============================== warnings summary ===============================
...
...
================== 18 passed, 30 warnings in 72.11s (0:01:12) ==================There was a problem hiding this comment.
Can you look at following section in the pytest output
============================= slowest 50 durations =============================
...
There was a problem hiding this comment.
Actually, 72.11s for all tests seems reasonable
Per reviewer feedback: the imports are now protected by the module-level
pytest.importorskip("transformers") guard (transformers_models.py) and by the
gpu_megatron test environment assumption (test file), so top-level imports are
the right style rather than lazy per-call imports.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
…nto hungyueh/pr-895
Signed-off-by: hychiang <hungyuehc@nvidia.com>
Top-level import of transformers.models.qwen3_vl fails on older transformers builds that don't have the submodule, breaking collection of every test that imports transformers_models.py. Use try/except so the module loads cleanly and get_tiny_qwen3vl() calls pytest.skip() when the classes are unavailable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
This reverts commit 10038f0. Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
This reverts commit 63a229a. Signed-off-by: Hung-Yueh Chiang <hungyuehc@nvidia.com>
4002b5a to
f56e4c2
Compare
|
\ok to test 74019e3 |
This PR is duplicated from PR #895. Since the original branch source is not available now, we create a new branch where we can update this PR.
What does this PR do?
new feature:
Overview: Add Qwen3-VL (Vision-Language) model support to the Megatron Core export/import
plugin, enabling HuggingFace-to-mcore weight conversion for PTQ/QAT/QAD workflows
Details
Qwen3-VL has a different weight structure from Qwen3 text-only models:
This PR adds:
Qwen3VLForConditionalGeneration and Megatron Core, handling the language_model prefix for
all decoder layers, QKV merging/slicing, gated MLP merging/slicing, Q/K layer norms.
all_mcore_hf_export_mapping and all_mcore_hf_import_mapping.
Usage
From the comment:
Create
Megatron-LM/examples/post_training/modelopt/conf/Qwen/Qwen3-VL-8B-Instruct.sh:Then, import Qwen3-VL from HuggingFace to MCore:
Testing
covering:
prefix, lm_head. at root, QKVMerging, GatedMLPMerging, REPLICATE
for layernorms, TP sharding configs
parallel_config
prefixes
language_model. prefix, lm_head unchanged
Before your PR is "Ready for review"
tests/gpu_megatron/torch/export/test_mcore_qwen3vl.pydocs/source/deployment/3_unified_hf.rstCHANGELOG.rstAdditional Information
Companion Megatron-LM PR adds Qwen3VLModel, Qwen3VLDataset, and pretrain_qwenvl.py. Please see this PR NVIDIA/Megatron-LM#3444
Summary by CodeRabbit
New Features
Documentation
Tests